Ontario
An Investigation of Memorization Risk in Healthcare Foundation Models
Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups.
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data
Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challenge has required manually designing the language grounding or curating massive datasets that associate language with the environment. We propose Ground-Compose-Reinforce, an end-to-end, neurosymbolic framework for training RL agents directly from high-level task specifications-- without manually designed reward functions or other domain-specific oracles, and without massive datasets. These task specifications take the form of Reward Machines, automata-based representations that capture high-level task structure and are in some cases autoformalizable from natural language. Critically, we show that Reward Machines can be grounded using limited data by exploiting compositionality. Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications--including behaviours that never appear in pretraining--while non-compositional approaches fail.
Learn and Ensemble Bridge Adapters for Multi-domain Task Incremental Learning
Multi-domain task incremental learning (MTIL) demands models to master domainspecific expertise while preserving generalization capabilities. Inspired by human lifelong learning [1, 2], which relies on revisiting, aligning, and integrating past experiences, we propose a Learning and Ensembling Bridge Adapters (LEBA) framework. To facilitate cohesive knowledge transfer across domains, specifically, we propose a continuous-domain bridge adaptation module, leveraging the distribution transfer capabilities of Schrรถdinger bridge for stable progressive learning. To strengthen memory consolidation, we further propose a progressive knowledge ensemble strategy that revisits past task representations via a diffusion model and dynamically integrates historical adapters. For efficiency, LEBA maintains a compact adapter pool through similarity-based selection and employs learnable weights to align replayed samples with current task semantics. Together, these components effectively mitigate catastrophic forgetting and enhance generalization across tasks.
d6d26053b977f8c589669fd201615119-Paper-Conference.pdf
Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution2), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LOGRA that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LOGIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LOGRA achieves competitive accuracy against more expensive baselines while showing up to 6,500 /5 improvements in compute/memory efficiency in influence computations as well as 2 speed-up in gradient statistics logging when applied to Llama3-8B-Instruct and the 1B-token subset of the OpenWebText dataset.
Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers
Large language models have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a selfincentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a generalized expectation-maximization algorithm.
An 80-Year-Old Math Problem Has Just Been Solved. You Might Not Like How We Got the Answer.
Science A.I.'s First Big Math Breakthrough Is Not What It Seems But it can help us do genuinely creative work--for a reason you might not expect. Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. Last month, OpenAI announced that its latest version of ChatGPT had solved a major math problem, one that had stumped experts for 80 years. This was considered among the most important unsolved problems in combinatorics, a prominent branch of math and computer science dealing with finite objects and arrangements. As opposed to previous A.I.-powered breakthroughs that involved back-and-forth conversations between a chatbot and a human expert, this was cracked with a single prompt.
1 Appendix 2 AMore Details
Score 0 4 (normal) is most common across cohorts, while score 3 (severe) is rare--especially in PD-GaM 5 and 3DGait, highlighting class imbalance challenges. BMCLab offers a balanced ON/OFF medication split, 7 while E-LC is skewed toward ON-medication. DNE includes healthy, Parkinsonian, and other disease 8 groups for broader contrastive training. Figure A.3 shows label distributions for FoG-related cohorts. This artifact likely stems from the unusual top-down perspective--different from the front15 facing or side views seen in WHAM's training data [1]. While motion encoder-based models may be 16 robust to such distortions, feature-based gait classifiers rely on precise kinematic measurements and 17 thus require carefully corrected input data. To correct this slope artifact, we perform a frame-wise 18 rigid alignment of the reconstructed SMPL skeleton using the Kabsch algorithm [2]. The goal is to 19 rotate each frame so that anatomical directions align with canonical coordinate axes (up, forward), 20 while preserving natural gait structure. This motion 28 vector is then projected onto the ground plane (xz-plane) and used as the walking axis. In frames where the sacrum displacement is less than 30 4mm--indicating near-stationary posture--we fall back on a proxy direction: the cross product of 31 the hip vector (left hip to right hip) and the vertical vector.
CARE-PD: AMulti-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment
Objective gait assessment in Parkinson's Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce CARE-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. CARE-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson's Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation. To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on CARE-PD reduces MPJPE (from 60.8 mm to 7.5 mm) and boosts PD severity macro-F1 by 17 percentage points, underscoring the value of clinically curated, diverse training data. CARE-PD and all benchmark code are released for non-commercial research at https://neurips2025.care-pd.ca.